How do social scientists collect data
to answer the questions?

PSCI 2270 - Week 3

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

September 17, 2024

Plan for this week


  1. Learning about population from sample

  2. Descriptive statistics

  3. Some math…

Plan for this week

  1. Learning about population from sample

From individual to population


  • We are usually interested in making inferences about groups of units, population

    • Example: all voters in specific election; all conflicts; all legislators in a country
  • To do so we collect multiple individual measurements

    • Individual measurement = exact value + chance error + bias
  • Once we we aggregate the formula becomes estimate = estimand + noise + bias

    • Estimate: - aggregate of individual measures
    • Estimand: - population quantity of interest
    • Noise: - aggregate of individual chance errors
    • Bias: - aggregate systematic error

Sampling


  • We often cannot survey or measure outcome among the whole set of units we are interested in \(\Rightarrow\) Target population

    • Example: People who will vote in the next election; All news reports from Fox News
  • We then have to resort to a subset of units that we can reasonably collect data for \(\Rightarrow\) Sample

    • Example: Those who participate in the survey; News reports that we can download online on a specific date
  • We collect the sample from the available list that ideally includes the whole population \(\Rightarrow\) Sampling frame

    • Example: List of registered voters; All news reports ever published online

Learning about populations

  • Probability: characterize the uncertainty about how we collect our data
  • Inference: learning about the population from a set of data

Sources of bias


  • Reliability and validity are still a concern

    • a small sample of non-reliable measures could still be off-target
    • any sample of non-valid measures will be off-target
  • In addition sampling brings more sources of bias:

    • Selection bias: our sampling frame is wrong
    • Non-response bias: we are more/less likely to observe certain types of units (people, documents, conflicts, etc.)

Types of sampling


  • Probability sampling: Every unit in the population has a known probability of being selected into sample
  • Simple random sampling: Every unit has an equal selection probability

    • e.g. random digit dialing (RDD):

      1. Take a particular area code + exchange: 617-495-XXXX.
      2. Randomly choose each digit in XXXX to call a particular phone.
      3. Every phone number in America has an equal chance of being included in sample.
    • …or random walk (?)

  • Quota sampling, cluster sampling, snowball sampling, etc.
  • Non-probability sampling: e.g. Opt-in Internet panels

1936 Literary Digest Poll



  • Literary Digest predicted elections using mail-in polls

  • Primary source of addresses: Automobile registrations, phone books, country club memberships

  • In 1936, sent out 10 million ballots, over 2.3 million returned

  • George Gallup sampled 50,000 respondents from all voting age citizens

Poll’s Result: Fail!

Pollster FDR’s Vote Share
Literary Digest 43%
George Gallup 56%
Actual Outcome 62%


  • Ballots skewed toward the wealthy (with cars, phones) \(\Rightarrow\) selection bias

    • Only 1 in 4 households had a phone in 1936.
  • In addition, people who respond could be different than those who don’t \(\Rightarrow\) non-response bias
  • When selection procedure is biased, adding more observations doesn’t help (we keep shooting off-target)

1948 Election

The Polling Disaster

Pollster Truman Dewey Thurmond Wallace
Crossley 45% 50% 2% 3%
Gallup 44% 50% 2% 4%
Roper 38% 53% 5% 4%
Actual Outcome 50% 45% 3% 2%


  • Quota sampling:

    • fixed quota of certain respondents for each interviewer
    • sample resembles the population on these characteristics
  • Most polls concluded ~2 weeks prior to the elections \(\Rightarrow\) selection bias

  • Republicans easier to interview within quotas (phones, listed addresses, etc.) \(\Rightarrow\) non-response bias

Plan for this week


  1. Learning about population from sample
  1. Descriptive statistics

What to DO with Measured Outcomes


  • A variable is a series of measurements (observations) of some concept
  • Descriptive (summary) statistics are numerical summaries of those observations

    • If we smart enough, we wouldn’t need them: just look at the list of numbers and completely understand
  • Two salient features of a variable that we want to know:

    • Central tendency: where is the middle/typical/average value
    • Spread: are all observations close to each other or spread out?

Center of the data: Mean

  • Center of the data: typical/average value
  • Mean: sum of the values divided by the number of observations

\[ \color{#98971a}{\bar{x}} = \color{#d65d0e}{\frac{1}{n}} \color{#458588}{\sum_{i = 1}^{n} x_{i}} \]

  • What’s all this notation?

    • Population value: Greek letters
    • Sample value: Latin letters or greek letters with hat
    • Miscellaneous squiggles (sums, hats, bars, subscripts)
  • Applied to the mean:

    • Population value: \(\mu\) (say mu)
    • Sample value: \(\hat{\mu}\) (say mu-hat); \(\bar{x}\) (say “x-bar”)

Center of the data: Median


  • Median: \[ \text{median} = \begin{cases} \text{middle value} & \text{if number of entries is odd} \\ \frac{\text{sum of two middle values}}{2} & \text{if number of entries is even} \end{cases} \]
  • Median more robust to outliers:

    • Example 1: \(\text{data} = \{ 0, 1, 2, 3, 5 \}\). \(\text{mean} = 2.2\), \(\text{median} = 2\)
    • Example 2: \(\text{data} = \{ 0, 1, 2, 3, 100 \}\). \(\text{mean} = 21.2\), \(\text{median} = 2\)
  • Question: What does Elon Musk do to the mean vs median income? \(\Rightarrow\) income inequality measure

Spread of the data

  • Are the data close to the center?
  • Range: \(\left[ \min (X), \max (X) \right]\)
  • Quantile (quartile, quintile, percentile, etc):

    • 25th percentile = lower quartile (25% of the data below this value)
    • 50th percentile = median (50% of the data below this value)
    • 75th percentile = upper quartile (75% of the data below this value)
  • Interquartile range (IQR): a measure of variability

    • How spread out is the middle half of the data?
    • Is most of the data really close to the median or are the values spread out?
  • One definition of outliers: over 1.5 × IQR above the upper quartile or below lower quartile

Standard deviation

  • Standard deviation (\(\sigma\), sd): On average, how far away are data points from the mean?

\[ \text{sd} = \color{#cc241d}{\sqrt{\color{#b16286}{\frac{1}{n - 1}} \color{#98971a}{\sum_{i = 1}^{n}} \color{#458588}{(}\color{#d65d0e}{x_i - \bar{x}}\color{#458588}{)^2} }} \]

  • Steps:

    1. Subtract each data point by the mean
    2. Square each resulting difference
    3. Take the sum of these values
    4. Divide by \(n − 1\)
    5. Take the square root
  • Variance (\(\sigma^2\), var): \(\text{Var} = \text{standard deviation}^2\)
  • Question: Why not just take the average deviations from mean without squaring?

Plan for this week


  1. Learning about population from sample

  2. Descriptive statistics

  1. Some math…

Some Building Blocks


  • Probability:

    • Basis for understanding uncertainty in our estimates
    • Statistics is applied probability
  • Law of Large Numbers

    • Perform the same task over and over (e.g., draw a sample)
    • Average of the results converges to the truth
  • Central Limit Theorem:

    • Add up a lot of independent factors
    • Result follows the normal distribution

Large random samples


  • In real data, we will have a set of \(n\) observations of a variable: \(X_1\) , \(X_2\), … , \(X_n\)

    • \(X_1\) is the age of the first randomly selected registered voter.
    • \(X_2\) is the age of the second randomly selected registered voter, etc.
  • Empirical analyses: summary of these \(n\) observations

    • All statistical procedures involve a statistic, very often sum or mean.
    • What are the properties of these sums and means?
    • Can the sample mean of age tell us anything about the population distribution of age?
  • Asymptotics: what can we learn as \(n\) gets big?

Stats Lingo: LLN


Law of Large Numbers (LLN)

Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.


  • The probability of \(\bar{X}_n\) being “far away” from \(\mu\) goes to \(0\) as \(n\) gets big
  • Intuition: If we roll the six-sided die many times, what do you think the average of rolls will be?
  • Important result: The distribution of sample means “collapses” to population mean if we draw many samples

Normal Distribution

  • The normal distribution is the classic “bell-shaped” curve.

    • Extremely ubiquitous in statistics
    • mean and variance follow standard notation
    • When \(X\) is distributed normally, we write \(X \sim N ( \mu, \sigma^2 )\)
  • Three key properties:

    • Unimodal: one peak at the mean
    • Symmetric around the mean
    • Everywhere positive: any real value can possibly occur

Stats Lingo: CLT


Central Limit Theorem (CLT)

Let \(X_1\) , … , \(X_n\) be a statistical sample from population with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) (sample mean) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) as \(n\) goes to infinity.


  • Approximation is better as sample size goes up

  • Important result: We now know how far away \(\bar{X}_n\) can be from its mean!

Impications of CLT/LLN


  • By CLT, sample mean \(\approx\) normal with mean \(\mu\) and sd of \(\sigma^2 / n\)
  • By empirical rule, sample mean will be within \(2 \times \sigma^2 / n\) of the population mean 95% of the time
  • We usually only have one sample, so we’ll only get one sample mean. So why do we care about LLN/CLT?

    • CLT gives us assurances our sample mean won’t be too far from population mean
    • CLT will also help us create measure of uncertainty for our estimates, standard error (SE):

    \[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

Next Week



  • Think about possible data strategies for answering question: Which factors affect election participation?

  • Applying CLT/LLN to get point estimates and estimates of uncertainty

  • Comparing group means and logic of causal inference

References